Developing linguistic theories using annotated corpora
نویسنده
چکیده
This paper aims to carve out a place for corpus research within theoretical linguistics and psycholinguistics. We argue that annotated corpora naturally complement native speaker intuitions and controlled psycholinguistic methods and thus can be powerful tools for developing and evaluating linguistic theories. We also review basic methods and best practices for moving from corpus annotations to hypothesis formation and testing, offering practical advice and technical guidance to researchers wishing to incorporate corpus methods into their work.
منابع مشابه
Ccls-13-02
The Linguistic Data Consortium (LDC) has developed hundreds of data corpora for natural language processing (NLP) research. Among these are a number of annotated treebank corpora for Arabic. Typically, these corpora consist of a single collection of annotated documents. NLP research, however, usually requires multiple data sets for the purposes of training models, developing techniques, and fin...
متن کاملDeveloping Morphologically Annotated Corpora for Minority Languages of Russia
Despite recent progress in developing annotated corpora for minority languages of Russia, still only about a dozen out of about 100 have comprehensive corpora, and even less have computational tools such as machine translation systems or speech recognition modules. However, given that many of them have resources such as dictionaries and grammars, the situation can be improved at relatively low ...
متن کاملParallel Aligned Treebank Corpora at LDC: Methodology, Annotation and Integration
The interest in syntactically-annotated data for improving machine translation quality has spurred the growing demand for parallel aligned treebank data. To meet this demand, the Linguistic Data Consortium (LDC) has created large volume, multi-lingual and multi-level aligned treebank corpora by aligning and integrating existing treebank annotation resources. Such corpora are more useful when th...
متن کاملDeveloping a Deep Linguistic Databank Supporting a Collection of Treebanks: the CINTIL DeepGramBank
Corpora of sentences annotated with grammatical information have been deployed by extending the basic lexical and morphological data with increasingly complex information, such as phrase constituency, syntactic functions, semantic roles, etc. As these corpora grow in size and the linguistic information to be encoded reaches higher levels of sophistication, the utilization of annotation tools an...
متن کاملSyntactically annotated corpora of Estonian
Syntactically annotated corpora are needed 1) to train and test parsers and various language technological products grammar checkers, information retrievers and extractors, machine translators etc; 2) to check the agreement of existing linguistic theories with the real language usage. The corpora can be annotated on different levels of depth. In shallow syntactically annotated corpora a syntact...
متن کامل